We start by reading in the data. The programs reads in the output h5
file filtered_feature_bc_matrix.h5 of the cellranger
pipeline from 10X, returning a unique molecular identified (UMI) count
matrix. The values in this matrix represent the number of molecules for
each feature (i.e. gene; row) that are detected in each cell (column).
To improve the quality of data, we include features detected in at least
3 cells and include cells where at least 200 features are detected.
Other features and cells will be dropped. Finally, we get 18746 features
and 7530 cells.
## An object of class Seurat
## 18746 features across 7530 samples within 1 assay
## Active assay: RNA (18746 features, 0 variable features)
Before analysing the single-cell gene expression data, we must ensure that all cellular barcode data correspond to viable cells. Cell quality control is performed based on four QC covariates:
The number of unique genes detected in each cell
(nFeature_RNA). Low-quality cells or empty droplets will
often have very few genes. Cell doublets or multiplets may exhibit an
aberrantly high gene count.
The total number of genes detected within a cell (correlates
strongly with unique genes) (nCount_RNA)
The percentage of reads that map to the mitochondrial genome
(percent.mito). Low-quality / dying cells often exhibit
extensive mitochondrial contamination. We use the set of all genes
starting with MT- as a set of mitochondrial genes
The percentage of reads that map to the ribosomal genome
(percent.ribo).
The violin plot for those four QC metrics was shown below:
Then we filter cells that have unique feature counts over 7500, genes detected within a cell more than 70000. We filter cells that have >15% mitochondrial counts and >20% ribosomal counts.
Note: we observed the high percentage of the mitochondrial counts, which may indicate the potential issues in sequencing, such as the poor sample quality, leading to a high fraction of apoptotic or lysing cells.
After filtering, the number of cells and violin plot is as follows:
| sample | before_qc | after_qc |
|---|---|---|
| Total | 7530 | 2991 |
| stomach | 7530 | 2991 |
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets. The figure below labled the top-10 highly variable features.
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. This step:
Here we apply a graph-based clustering approach, building upon initial strategies in (Macosko et al). Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
We first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). To cluster the cells, we next apply modularity optimization techniques(Louvain algorithm) to iteratively group cells together, with the goal of optimizing the standard modularity function.
After clustering, we get 9 clusters. Then we use UMAP to visualize and explore these clusters The goal of UMAP is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots.
The cell number of each cluster listed below:
##
## 0 1 2 3 4 5 6 7
## 882 704 585 460 209 67 46 38
Here we use violin plot to show expression probability distributions across clusters. Note that the marker ‘Car1’ was not found in the features.
The following violin plots and feature plots of three groups genes from ‘Basu_Expectation.pdf’.
The plots for Kdr gene were required in the email on December 28.
The plots for Drd1, Drd2, Drd3, Drd4 and Drd5 genes were required in the email on November 4.
The dopamine and dopamine receptors (D1, D2, D3, D4 and D5) expressed in few cells. In stomach, the Drd1 expressed in 2 cells, Drd2 expressed in 2 cells, Drd3 expressed in 1 cell, Drd4 and Drd5 have no expression. In colon, the Drd1 expressed in 1 cell, Drd2 expressed in 0 cell, Drd3 expressed in 1 cell, Drd4 expressed in 54 cells, and Drd5 have no expression. In the QC part, they were filtered. We can’t get the violin plots for them.
The plots for Th gene were required in the email on October 24.
In stomach Annotations:
Epithelial Cluster: Cluster 0
BEC Cluster: Cluster 2
LEC Cluster : Cluster 5
Epithelial Cluster: EPCAM and panCK (KRT4, KRT6, KRT7, KRT8, KRT10,
KRT17, KRT18, KRT19 and KRT20)
BEC Cluster: PECAM-1/CD31, Tie2/TEK and Kdr/FLK-1
LEC Cluster : LYVE-1, PROX1 and Flt4
To show significant results, we don’t find subclusters for cluster
LEC because of the low number of cells.